pip install gap-stat
Requirement already satisfied: gap-stat in /Users/vishak/anaconda3/lib/python3.11/site-packages (2.0.3) Requirement already satisfied: numpy in /Users/vishak/anaconda3/lib/python3.11/site-packages (from gap-stat) (1.24.3) Requirement already satisfied: pandas in /Users/vishak/anaconda3/lib/python3.11/site-packages (from gap-stat) (2.1.4) Requirement already satisfied: scipy in /Users/vishak/anaconda3/lib/python3.11/site-packages (from gap-stat) (1.11.1) Requirement already satisfied: python-dateutil>=2.8.2 in /Users/vishak/anaconda3/lib/python3.11/site-packages (from pandas->gap-stat) (2.8.2) Requirement already satisfied: pytz>=2020.1 in /Users/vishak/anaconda3/lib/python3.11/site-packages (from pandas->gap-stat) (2023.3.post1) Requirement already satisfied: tzdata>=2022.1 in /Users/vishak/anaconda3/lib/python3.11/site-packages (from pandas->gap-stat) (2023.3) Requirement already satisfied: six>=1.5 in /Users/vishak/anaconda3/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas->gap-stat) (1.16.0) Note: you may need to restart the kernel to use updated packages.
The dataset captures transactions from December 2009 to December 2011 from a UK-based online retailer. Fields include:
The project is structured into two main parts:
1. EDA (Exploratory Data Analysis) and Extraction of Business Insights:
2. Customer Segmentation using RFM Analysis:
This comprehensive approach will help uncover valuable insights into product popularity, sales trends, and customer segmentation, providing a holistic understanding of the dataset.
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import requests
import plotly.express as px
import missingno as msno
import warnings
import scipy.stats as stats
from io import StringIO
from sklearn.preprocessing import LabelEncoder
from datetime import datetime
from tqdm import tqdm
label_encoder = LabelEncoder()
pd.set_option('display.max_columns', None)
pd.set_option('display.max_rows', None)
warnings.filterwarnings('ignore')
data1=pd.read_csv('Online Retail.csv')
data2=pd.read_csv('online_retail_II.csv')
display(data1.head())
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 536365 | 85123A | WHITE HANGING HEART T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 2.55 | 17850.0 | United Kingdom |
| 1 | 536365 | 71053 | WHITE METAL LANTERN | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
display(data2.head())
| Invoice | StockCode | Description | Quantity | InvoiceDate | Price | Customer ID | Country | |
|---|---|---|---|---|---|---|---|---|
| 0 | 489434 | 85048 | 15CM CHRISTMAS GLASS BALL 20 LIGHTS | 12 | 12/1/2009 7:45 | 6.95 | 13085.0 | United Kingdom |
| 1 | 489434 | 79323P | PINK CHERRY LIGHTS | 12 | 12/1/2009 7:45 | 6.75 | 13085.0 | United Kingdom |
| 2 | 489434 | 79323W | WHITE CHERRY LIGHTS | 12 | 12/1/2009 7:45 | 6.75 | 13085.0 | United Kingdom |
| 3 | 489434 | 22041 | RECORD FRAME 7" SINGLE SIZE | 48 | 12/1/2009 7:45 | 2.10 | 13085.0 | United Kingdom |
| 4 | 489434 | 21232 | STRAWBERRY CERAMIC TRINKET BOX | 24 | 12/1/2009 7:45 | 1.25 | 13085.0 | United Kingdom |
print(f'''Data1 Column names
{data1.columns}
-------------------------
Data2 Column names:
{data2.columns}''')
Data1 Column names
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'UnitPrice', 'CustomerID', 'Country'],
dtype='object')
-------------------------
Data2 Column names:
Index(['Invoice', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'Price', 'Customer ID', 'Country'],
dtype='object')
# Renaming Column names
column_mapping = {
'Invoice': 'InvoiceNo',
'Price': 'UnitPrice',
'Customer ID': 'CustomerID',
}
data2.rename(columns=column_mapping, inplace=True)
print(data2.columns)
Index(['InvoiceNo', 'StockCode', 'Description', 'Quantity', 'InvoiceDate',
'UnitPrice', 'CustomerID', 'Country'],
dtype='object')
data = pd.merge(data1, data2, how='outer')
print(f'''Data1 Shape:{data1.shape},
Data2 Shape:{data2.shape},
Data Shape:{data.shape}''')
Data1 Shape:(541909, 8), Data2 Shape:(525461, 8), Data Shape:(1045545, 8)
def table_summary(data):
summary_data = []
for column in data.columns:
num_unique_values = data[column].nunique()
num_empty_values = data[column].isnull().sum()
data_type = data[column].dtypes
percentage_empty = (num_empty_values / len(data)) * 100
summary_data.append([column, num_unique_values, num_empty_values, percentage_empty, data_type])
summary_df = pd.DataFrame(summary_data, columns=['Column Name', 'Unique Values', 'Empty Values', 'Percentage Empty', 'Data Type'])
print(f'''Table Summary
Dataframe Shape: {data.shape}''')
display(summary_df)
msno.matrix(data)
plt.title('Missing Values Overview', fontsize=16)
plt.xlabel('Columns', fontsize=14)
plt.ylabel('Percentage of Missing Values', fontsize=14)
plt.show()
table_summary(data)
Table Summary Dataframe Shape: (1045545, 8)
| Column Name | Unique Values | Empty Values | Percentage Empty | Data Type | |
|---|---|---|---|---|---|
| 0 | InvoiceNo | 53628 | 0 | 0.000000 | object |
| 1 | StockCode | 5305 | 0 | 0.000000 | object |
| 2 | Description | 5698 | 4275 | 0.408878 | object |
| 3 | Quantity | 1057 | 0 | 0.000000 | int64 |
| 4 | InvoiceDate | 47635 | 0 | 0.000000 | object |
| 5 | UnitPrice | 2807 | 0 | 0.000000 | float64 |
| 6 | CustomerID | 5942 | 235289 | 22.503957 | float64 |
| 7 | Country | 43 | 0 | 0.000000 | object |
We have missing values in the Description column. Initially, I'll examine whether these missing values are associated with valid StockCode entries and determine if there are corresponding descriptions for those StockCode entries.
desc_nan=data[data['Description'].isnull()]
display(desc_nan.head())
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 658 | 536414 | 22139 | NaN | 56 | 12/1/2010 11:52 | 0.0 | NaN | United Kingdom |
| 2046 | 536545 | 21134 | NaN | 1 | 12/1/2010 14:32 | 0.0 | NaN | United Kingdom |
| 2047 | 536546 | 22145 | NaN | 1 | 12/1/2010 14:33 | 0.0 | NaN | United Kingdom |
| 2048 | 536547 | 37509 | NaN | 1 | 12/1/2010 14:33 | 0.0 | NaN | United Kingdom |
| 2063 | 536549 | 85226A | NaN | 1 | 12/1/2010 14:34 | 0.0 | NaN | United Kingdom |
# getting the list of unique StockCode from desc_nan dataset
stockcodes_list = list(desc_nan['StockCode'].unique())
# filtering the 'data' containing only the StockCode from stockcodes_list
mask = data['StockCode'].isin(stockcodes_list)
filtered_data = data[mask]
display(filtered_data.head())
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | |
|---|---|---|---|---|---|---|---|---|
| 2 | 536365 | 84406B | CREAM CUPID HEARTS COAT HANGER | 8 | 12/1/2010 8:26 | 2.75 | 17850.0 | United Kingdom |
| 3 | 536365 | 84029G | KNITTED UNION FLAG HOT WATER BOTTLE | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 4 | 536365 | 84029E | RED WOOLLY HOTTIE WHITE HEART. | 6 | 12/1/2010 8:26 | 3.39 | 17850.0 | United Kingdom |
| 5 | 536365 | 22752 | SET 7 BABUSHKA NESTING BOXES | 2 | 12/1/2010 8:26 | 7.65 | 17850.0 | United Kingdom |
| 6 | 536365 | 21730 | GLASS STAR FROSTED T-LIGHT HOLDER | 6 | 12/1/2010 8:26 | 4.25 | 17850.0 | United Kingdom |
filtered_data.shape
(375540, 8)
Fill missing values in the 'Description' column by replacing them with the available descriptions associated with the respective 'StockCode' if such descriptions exist.
# Identify unique StockCodes where Description is NaN
missing_description_stockcodes = desc_nan['StockCode'].unique()
# go through each StockCodes
for stockcode in tqdm(missing_description_stockcodes):
# Check if there are non-null Descriptions for the current StockCode
non_null_descriptions = data[data['StockCode'] == stockcode]['Description'].dropna()
if not non_null_descriptions.empty:
# If non-null Descriptions exist, fill in missing values with the first non-null Description
fill_value = non_null_descriptions.iloc[0]
condition = (data['StockCode'] == stockcode) & (data['Description'].isnull())
data.loc[condition, 'Description'] = fill_value
100%|███████████████████████████████████████| 2451/2451 [03:04<00:00, 13.32it/s]
def count_unique_values(data, columns):
summary_data = []
for column in columns:
num_unique_values = data[column].nunique()
summary_data.append([column, num_unique_values])
summary_df = pd.DataFrame(summary_data, columns=['Column Name', 'Unique Values'])
display(summary_df)
count_unique_values(data,['StockCode','Description'])
| Column Name | Unique Values | |
|---|---|---|
| 0 | StockCode | 5305 |
| 1 | Description | 5698 |
Counting the number of Descriptioneach StockCode has.
# Group by 'StockCode' and count the unique 'Description' values
stockcode_description_counts = data.groupby('StockCode')['Description'].nunique()
# display(stockcode_description_counts)
# Filter StockCodes with more than one unique Description
stockcodes_with_multiple_descriptions = stockcode_description_counts[stockcode_description_counts > 1].index.tolist()
# display(stockcodes_with_multiple_descriptions)
# Filter rows with StockCodes having more than one unique Description
multiple_description_stockcodes = stockcode_description_counts[stockcode_description_counts > 1].index
# Create a df containing StockCodes with multiple Descriptions
multiple_description_data = data[data['StockCode'].isin(multiple_description_stockcodes)]
# Group by 'StockCode' and collect unique 'Description' values
stockcode_descriptions_multiple = multiple_description_data.groupby('StockCode')['Description'].unique().reset_index()
# Display the result
display(stockcode_descriptions_multiple.head())
| StockCode | Description | |
|---|---|---|
| 0 | 10080 | [GROOVY CACTUS INFLATABLE, check] |
| 1 | 10120 | [DOGGY RUBBER, Zebra invcing error] |
| 2 | 10133 | [COLOURING PENCILS BROWN TUBE, damaged] |
| 3 | 15056N | [EDWARDIAN PARASOL NATURAL, wedding co returns?] |
| 4 | 15058A | [BLUE POLKADOT GARDEN PARASOL, wet/rusty, BLUE... |
# Iterate through StockCodes with multiple Descriptions
for stockcode in tqdm(stockcodes_with_multiple_descriptions):
# Get the first Description for the current StockCode
first_description = data[data['StockCode'] == stockcode]['Description'].dropna().iloc[0]
# Replace all Descriptions for the current StockCode with the first one
data.loc[data['StockCode'] == stockcode, 'Description'] = first_description
100%|███████████████████████████████████████| 1232/1232 [01:20<00:00, 15.26it/s]
count_unique_values(data,['StockCode','Description'])
| Column Name | Unique Values | |
|---|---|---|
| 0 | StockCode | 5305 |
| 1 | Description | 4743 |
# Filter rows with StockCodes having more than one unique Description
multiple_description_stockcodes = stockcode_description_counts[stockcode_description_counts > 1].index
# Create a DataFrame containing StockCodes with multiple Descriptions
multiple_description_data = data[data['StockCode'].isin(multiple_description_stockcodes)]
# Group by 'StockCode' and collect unique 'Description' values
stockcode_descriptions_multiple = multiple_description_data.groupby('StockCode')['Description'].unique().reset_index()
# Display the result
display(stockcode_descriptions_multiple.head())
| StockCode | Description | |
|---|---|---|
| 0 | 10080 | [GROOVY CACTUS INFLATABLE] |
| 1 | 10120 | [DOGGY RUBBER] |
| 2 | 10133 | [COLOURING PENCILS BROWN TUBE] |
| 3 | 15056N | [EDWARDIAN PARASOL NATURAL] |
| 4 | 15058A | [BLUE POLKADOT GARDEN PARASOL] |
count_unique_values(data,['StockCode','Description'])
| Column Name | Unique Values | |
|---|---|---|
| 0 | StockCode | 5305 |
| 1 | Description | 4743 |
Now there are Description with Multiple StockCode, Need to Clean this
# Group by 'Description' and collect unique 'StockCode' values
description_stockcodes_multiple = data.groupby('Description')['StockCode'].unique().reset_index()
# Filter rows where there are multiple unique StockCodes for a Description
description_stockcodes_multiple = description_stockcodes_multiple[description_stockcodes_multiple['StockCode'].apply(len) > 1]
# Display the result
display(description_stockcodes_multiple.head())
| Description | StockCode | |
|---|---|---|
| 49 | 3 GARDENIA MORRIS BOXED CANDLES | [85034A, 85034a] |
| 63 | 3 WHITE CHOC MORRIS BOXED CANDLES | [85034B, 85034b] |
| 74 | 3D DOG PICTURE PLAYING CARDS | [84558A, 84558a] |
| 76 | 3D SHEET OF CAT STICKERS | [84559B, 84559b] |
| 77 | 3D SHEET OF DOG STICKERS | [84559A, 84559a] |
# Iterate through rows with multiple StockCodes for a Description
for index, row in description_stockcodes_multiple.iterrows():
# Get the first StockCode for the current Description
first_stockcode = row['StockCode'][0]
# Replace all StockCodes for the current Description with the first one
data.loc[data['Description'] == row['Description'], 'StockCode'] = first_stockcode
# Display the result
description_stockcodes_multiple = data.groupby('Description')['StockCode'].unique().reset_index()
display(description_stockcodes_multiple.head())
| Description | StockCode | |
|---|---|---|
| 0 | 4 PURPLE FLOCK DINNER CANDLES | [72800B] |
| 1 | 50'S CHRISTMAS GIFT BAG LARGE | [23437] |
| 2 | DOLLY GIRL BEAKER | [23345] |
| 3 | HOME SWEET HOME BLACKBOARD | [21185] |
| 4 | I LOVE LONDON MINI BACKPACK | [23391] |
count_unique_values(data,['StockCode','Description'])
| Column Name | Unique Values | |
|---|---|---|
| 0 | StockCode | 5098 |
| 1 | Description | 4743 |
table_summary(data)
Table Summary Dataframe Shape: (1045545, 8)
| Column Name | Unique Values | Empty Values | Percentage Empty | Data Type | |
|---|---|---|---|---|---|
| 0 | InvoiceNo | 53628 | 0 | 0.000000 | object |
| 1 | StockCode | 5098 | 0 | 0.000000 | object |
| 2 | Description | 4743 | 363 | 0.034719 | object |
| 3 | Quantity | 1057 | 0 | 0.000000 | int64 |
| 4 | InvoiceDate | 47635 | 0 | 0.000000 | object |
| 5 | UnitPrice | 2807 | 0 | 0.000000 | float64 |
| 6 | CustomerID | 5942 | 235289 | 22.503957 | float64 |
| 7 | Country | 43 | 0 | 0.000000 | object |
There are still Missing values in Description, So i'll be dropping them
data = data.dropna(subset=['Description'])
data['Description'] = data['Description'].str.title()
data.shape
(1045182, 8)
data['InvoiceDate'] = pd.to_datetime(data['InvoiceDate'])
# Extract Year, Quarter, Month, Week, and Day
data['Year'] = data['InvoiceDate'].dt.year
data['Quarter'] = data['InvoiceDate'].dt.quarter
data['Month'] = data['InvoiceDate'].dt.month
data['Week'] = data['InvoiceDate'].dt.isocalendar().week
data['Day'] = data['InvoiceDate'].dt.day
data['TotalPrice']=data['Quantity']*data['UnitPrice']
data['YearQuarter'] = data['InvoiceDate'].dt.to_period("Q")
data['YearMonth'] = data['InvoiceDate'].dt.to_period("M")
data['Date'] = data['InvoiceDate'].dt.date
data['DayOfWeek'] = data['InvoiceDate'].dt.day_name()
total_duplicates = data.duplicated(keep=False).sum()
# Display the total number of duplicate rows
print("Total Number of Duplicates:", total_duplicates)
Total Number of Duplicates: 24920
data=data.drop_duplicates(keep='first')
Although there are approximately 22% missing values in the CustomerID field, it's important to note that these entries contribute to the generated income. Therefore, I will retain these values for my analysis
# For analysis purposes, I'm filling the null values in the 'CustomerID' column with '00000' to ensure completeness of the data.
data['CustomerID'] = data['CustomerID'].fillna('00000')
data['CustomerID'] = data['CustomerID'].astype('int64')
data.info()
<class 'pandas.core.frame.DataFrame'> Index: 1031929 entries, 0 to 1045544 Data columns (total 18 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 InvoiceNo 1031929 non-null object 1 StockCode 1031929 non-null object 2 Description 1031929 non-null object 3 Quantity 1031929 non-null int64 4 InvoiceDate 1031929 non-null datetime64[ns] 5 UnitPrice 1031929 non-null float64 6 CustomerID 1031929 non-null int64 7 Country 1031929 non-null object 8 Year 1031929 non-null int32 9 Quarter 1031929 non-null int32 10 Month 1031929 non-null int32 11 Week 1031929 non-null UInt32 12 Day 1031929 non-null int32 13 TotalPrice 1031929 non-null float64 14 YearQuarter 1031929 non-null period[Q-DEC] 15 YearMonth 1031929 non-null period[M] 16 Date 1031929 non-null object 17 DayOfWeek 1031929 non-null object dtypes: UInt32(1), datetime64[ns](1), float64(2), int32(4), int64(2), object(6), period[M](1), period[Q-DEC](1) memory usage: 130.9+ MB
data.describe(include='all')
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | Year | Quarter | Month | Week | Day | TotalPrice | YearQuarter | YearMonth | Date | DayOfWeek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1031929 | 1031929 | 1031929 | 1.031929e+06 | 1031929 | 1.031929e+06 | 1.031929e+06 | 1031929 | 1.031929e+06 | 1.031929e+06 | 1.031929e+06 | 1031929.0 | 1.031929e+06 | 1.031929e+06 | 1031929 | 1031929 | 1031929 | 1031929 |
| unique | 53265 | 4743 | 4740 | NaN | NaN | NaN | NaN | 43 | NaN | NaN | NaN | <NA> | NaN | NaN | 9 | 25 | 604 | 7 |
| top | 573585 | 85123A | White Hanging Heart T-Light Holder | NaN | NaN | NaN | NaN | United Kingdom | NaN | NaN | NaN | <NA> | NaN | NaN | 2010Q4 | 2011-11 | 2011-12-05 | Thursday |
| freq | 1114 | 5752 | 5752 | NaN | NaN | NaN | NaN | 947263 | NaN | NaN | NaN | <NA> | NaN | NaN | 176778 | 83261 | 5290 | 195961 |
| mean | NaN | NaN | NaN | 1.009616e+01 | 2011-01-03 16:20:29.697507328 | 4.616598e+00 | 1.183130e+04 | NaN | 2.010436e+03 | 2.786267e+00 | 7.394530e+00 | 30.054866 | 1.534869e+01 | 1.825673e+01 | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | -8.099500e+04 | 2009-12-01 07:45:00 | -5.359436e+04 | 0.000000e+00 | NaN | 2.009000e+03 | 1.000000e+00 | 1.000000e+00 | 1.0 | 1.000000e+00 | -1.684696e+05 | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | 1.000000e+00 | 2010-07-05 12:40:00 | 1.250000e+00 | 1.253000e+04 | NaN | 2.010000e+03 | 2.000000e+00 | 4.000000e+00 | 17.0 | 8.000000e+00 | 3.750000e+00 | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | 3.000000e+00 | 2010-12-09 13:34:00 | 2.100000e+00 | 1.450600e+04 | NaN | 2.010000e+03 | 3.000000e+00 | 8.000000e+00 | 33.0 | 1.500000e+01 | 9.920000e+00 | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | 1.000000e+01 | 2011-07-27 13:15:00 | 4.150000e+00 | 1.636000e+04 | NaN | 2.011000e+03 | 4.000000e+00 | 1.100000e+01 | 44.0 | 2.300000e+01 | 1.770000e+01 | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | 8.099500e+04 | 2011-12-09 12:50:00 | 3.897000e+04 | 1.828700e+04 | NaN | 2.011000e+03 | 4.000000e+00 | 1.200000e+01 | 52.0 | 3.100000e+01 | 1.684696e+05 | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | 1.751800e+02 | NaN | 1.224631e+02 | 6.589141e+03 | NaN | 5.764536e-01 | 1.144788e+00 | 3.500319e+00 | 15.175768 | 8.642581e+00 | 2.958430e+02 | NaN | NaN | NaN | NaN |
negative_quantity=data[data['Quantity']<0]
negative_quantity.head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | Year | Quarter | Month | Week | Day | TotalPrice | YearQuarter | YearMonth | Date | DayOfWeek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 141 | C536379 | D | Discount | -1 | 2010-12-01 09:41:00 | 27.50 | 14527 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -27.50 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 154 | C536383 | 35004C | Set Of 3 Coloured Flying Ducks | -1 | 2010-12-01 09:49:00 | 4.65 | 15311 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -4.65 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 235 | C536391 | 22556 | Plasters In Tin Circus Parade | -12 | 2010-12-01 10:24:00 | 1.65 | 17548 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -19.80 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 236 | C536391 | 21984 | Pack Of 12 Pink Paisley Tissues | -24 | 2010-12-01 10:24:00 | 0.29 | 17548 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -6.96 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 237 | C536391 | 21983 | Pack Of 12 Blue Paisley Tissues | -24 | 2010-12-01 10:24:00 | 0.29 | 17548 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -6.96 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
negative_quantity.shape
(22165, 18)
There are over 20k observations with negative quantities. Most of the negative values are Cancelled orders. However, there are values where InvoiceNo number does not contain 'C'
cancelled=data[data['InvoiceNo'].astype(str).str.contains('C')]
filtered_data = data[~data['InvoiceNo'].astype(str).str.contains('C')]
display(cancelled.head())
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | Year | Quarter | Month | Week | Day | TotalPrice | YearQuarter | YearMonth | Date | DayOfWeek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 141 | C536379 | D | Discount | -1 | 2010-12-01 09:41:00 | 27.50 | 14527 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -27.50 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 154 | C536383 | 35004C | Set Of 3 Coloured Flying Ducks | -1 | 2010-12-01 09:49:00 | 4.65 | 15311 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -4.65 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 235 | C536391 | 22556 | Plasters In Tin Circus Parade | -12 | 2010-12-01 10:24:00 | 1.65 | 17548 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -19.80 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 236 | C536391 | 21984 | Pack Of 12 Pink Paisley Tissues | -24 | 2010-12-01 10:24:00 | 0.29 | 17548 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -6.96 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 237 | C536391 | 21983 | Pack Of 12 Blue Paisley Tissues | -24 | 2010-12-01 10:24:00 | 0.29 | 17548 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -6.96 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
cancelled.shape
(19076, 18)
cancelled['Quantity'].max()
1
cancelled[cancelled['Quantity']>0]
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | Year | Quarter | Month | Week | Day | TotalPrice | YearQuarter | YearMonth | Date | DayOfWeek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 619406 | C496350 | M | Manual | 1 | 2010-02-01 08:24:00 | 373.57 | 0 | United Kingdom | 2010 | 1 | 2 | 5 | 1 | 373.57 | 2010Q1 | 2010-02 | 2010-02-01 | Monday |
filtered_data[filtered_data['Quantity']<0].head()
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | Year | Quarter | Month | Week | Day | TotalPrice | YearQuarter | YearMonth | Date | DayOfWeek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 2498 | 536589 | 21777 | Recipe Box With Metal Heart | -10 | 2010-12-01 16:50:00 | 0.0 | 0 | United Kingdom | 2010 | 4 | 12 | 48 | 1 | -0.0 | 2010Q4 | 2010-12 | 2010-12-01 | Wednesday |
| 4477 | 536764 | 84952C | Mirror Love Bird T-Light Holder | -38 | 2010-12-02 14:42:00 | 0.0 | 0 | United Kingdom | 2010 | 4 | 12 | 48 | 2 | -0.0 | 2010Q4 | 2010-12 | 2010-12-02 | Thursday |
| 7404 | 536996 | 22712 | Card Dolly Girl | -20 | 2010-12-03 15:30:00 | 0.0 | 0 | United Kingdom | 2010 | 4 | 12 | 48 | 3 | -0.0 | 2010Q4 | 2010-12 | 2010-12-03 | Friday |
| 7405 | 536997 | 22028 | Penny Farthing Birthday Card | -20 | 2010-12-03 15:30:00 | 0.0 | 0 | United Kingdom | 2010 | 4 | 12 | 48 | 3 | -0.0 | 2010Q4 | 2010-12 | 2010-12-03 | Friday |
| 7406 | 536998 | 85067 | Cream Sweetheart Wall Cabinet | -6 | 2010-12-03 15:30:00 | 0.0 | 0 | United Kingdom | 2010 | 4 | 12 | 48 | 3 | -0.0 | 2010Q4 | 2010-12 | 2010-12-03 | Friday |
filtered_data.shape
(1012853, 18)
filtered_data['UnitPrice'].nunique()
2280
filtered_data['InvoiceNo'].nunique()
44973
filtered_data.describe(include='all')
| InvoiceNo | StockCode | Description | Quantity | InvoiceDate | UnitPrice | CustomerID | Country | Year | Quarter | Month | Week | Day | TotalPrice | YearQuarter | YearMonth | Date | DayOfWeek | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1012853 | 1012853 | 1012853 | 1.012853e+06 | 1012853 | 1.012853e+06 | 1.012853e+06 | 1012853 | 1.012853e+06 | 1.012853e+06 | 1.012853e+06 | 1012853.0 | 1.012853e+06 | 1.012853e+06 | 1012853 | 1012853 | 1012853 | 1012853 |
| unique | 44973 | 4737 | 4734 | NaN | NaN | NaN | NaN | 43 | NaN | NaN | NaN | <NA> | NaN | NaN | 9 | 25 | 604 | 7 |
| top | 573585 | 85123A | White Hanging Heart T-Light Holder | NaN | NaN | NaN | NaN | United Kingdom | NaN | NaN | NaN | <NA> | NaN | NaN | 2010Q4 | 2011-11 | 2011-12-05 | Thursday |
| freq | 1114 | 5618 | 5618 | NaN | NaN | NaN | NaN | 931000 | NaN | NaN | NaN | <NA> | NaN | NaN | 173831 | 82190 | 5264 | 191379 |
| mean | NaN | NaN | NaN | 1.075685e+01 | 2011-01-04 02:40:45.063932672 | 3.895506e+00 | 1.178207e+04 | NaN | 2.010436e+03 | 2.787781e+00 | 7.399237e+00 | 30.074897 | 1.534964e+01 | 2.004322e+01 | NaN | NaN | NaN | NaN |
| min | NaN | NaN | NaN | -9.600000e+03 | 2009-12-01 07:45:00 | -5.359436e+04 | 0.000000e+00 | NaN | 2.009000e+03 | 1.000000e+00 | 1.000000e+00 | 1.0 | 1.000000e+00 | -5.359436e+04 | NaN | NaN | NaN | NaN |
| 25% | NaN | NaN | NaN | 1.000000e+00 | 2010-07-05 16:48:00 | 1.250000e+00 | 1.249000e+04 | NaN | 2.010000e+03 | 2.000000e+00 | 4.000000e+00 | 17.0 | 8.000000e+00 | 3.900000e+00 | NaN | NaN | NaN | NaN |
| 50% | NaN | NaN | NaN | 3.000000e+00 | 2010-12-09 14:09:00 | 2.100000e+00 | 1.450500e+04 | NaN | 2.010000e+03 | 3.000000e+00 | 8.000000e+00 | 33.0 | 1.500000e+01 | 1.000000e+01 | NaN | NaN | NaN | NaN |
| 75% | NaN | NaN | NaN | 1.200000e+01 | 2011-07-27 15:16:00 | 4.130000e+00 | 1.636000e+04 | NaN | 2.011000e+03 | 4.000000e+00 | 1.100000e+01 | 44.0 | 2.300000e+01 | 1.770000e+01 | NaN | NaN | NaN | NaN |
| max | NaN | NaN | NaN | 8.099500e+04 | 2011-12-09 12:50:00 | 2.511109e+04 | 1.828700e+04 | NaN | 2.011000e+03 | 4.000000e+00 | 1.200000e+01 | 52.0 | 3.100000e+01 | 1.684696e+05 | NaN | NaN | NaN | NaN |
| std | NaN | NaN | NaN | 1.373152e+02 | NaN | 9.497313e+01 | 6.625575e+03 | NaN | 5.762176e-01 | 1.144708e+00 | 3.499990e+00 | 15.175175 | 8.642379e+00 | 2.204710e+02 | NaN | NaN | NaN | NaN |
We can see there are some outliers in the TotalPrice, Quantity and UnitPrice column
fig = px.box(filtered_data, x='TotalPrice')
fig.show()